Skills: Correlations and model fit

Week 3

This page demonstrates how to generate values for three difference measures of model fit.

I’ll start by loading the full dataset from last week’s assignment (the one where I had added some random noise to the outcome).

library(tidyverse)
library(here)
library(knitr)

full_data <- here("week2",
                  "full-data.csv") |>
  read_csv()

Correlation

Correlation describes how well the relationship between two continuous variables can be described by a linear relationship.

R

In R, you can can calculate the correlation between all pairs of variables in a data frame by using the cor() function. Since we only want the correlations between pairs of continuous variables, we’ll start by using the select() function to choose just the variables we want to include in our correlation table.

cor_mat <- full_data |>
  select(sq_feet,
         dt_dist,
         rent) |>
  cor()

cor_mat |>
  kable()
sq_feet dt_dist rent
sq_feet 1.0000000 0.0072741 0.5911491
dt_dist 0.0072741 1.0000000 -0.1842930
rent 0.5911491 -0.1842930 1.0000000

Excel

In Excel, you can calculate the correlation between two variables using the =CORREL() function.

R-squared

Another way to calculate a correlation is to estimate a model with a single predictor. The square root of the R-squared value will be the correlation between the predictor and the outcome. You can also use R-squared to decribe the fit of a model with multiple predictors

R

In R, after you estimate a model using the lm() function, you can use the summary() function to see a summary of the results.

The R-squared value will be shown as Multiple R-squared: towards the bottom of the summary.

model <- lm(rent ~ sq_feet + dt_dist + color, data = full_data)

summary(model)
## 
## Call:
## lm(formula = rent ~ sq_feet + dt_dist + color, data = full_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -446.89  -85.17   -8.46   77.36  518.58 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.255e+02  1.453e+01   15.52   <2e-16 ***
## sq_feet      6.986e-01  8.256e-03   84.61   <2e-16 ***
## dt_dist     -4.824e+01  1.775e+00  -27.17   <2e-16 ***
## colorGreen   4.737e+01  3.898e+00   12.15   <2e-16 ***
## colorRed    -1.052e+02  2.723e+00  -38.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.8 on 9995 degrees of freedom
## Multiple R-squared:  0.5063, Adjusted R-squared:  0.5061 
## F-statistic:  2563 on 4 and 9995 DF,  p-value: < 2.2e-16

You can also just return the R-squared value on its own. This is useful if you are comparing the fit of multiple models and you don’t want to be tempted to select your preferred model based on model coefficients and their associated p-values.

summary(model)$r.squared
## [1] 0.5063459

Excel

In Excel, if you run the =LINEST() function to estimate a regression model, the value in the first column of the third row will be the R-squared value for the regression.

Standard error of regression

The standard error of the regression can be used to generate confidence intervals around a prediction.

R

In the output from the summary() function in R, the standard error of the regression is above the R-squared value, and labeled as Residual standard error:.

You can also pull out the standard error of regression directly by referring to it as sigma.

summary(model)$sigma
## [1] 123.7943

Excel

In Excel, if you run the =LINEST() function to estimate a regression model, the value in the second column of the third row will be the standard error of the regression.